148 research outputs found
Selecting Near-Optimal Learners via Incremental Data Allocation
We study a novel machine learning (ML) problem setting of sequentially
allocating small subsets of training data amongst a large set of classifiers.
The goal is to select a classifier that will give near-optimal accuracy when
trained on all data, while also minimizing the cost of misallocated samples.
This is motivated by large modern datasets and ML toolkits with many
combinations of learning algorithms and hyper-parameters. Inspired by the
principle of "optimism under uncertainty," we propose an innovative strategy,
Data Allocation using Upper Bounds (DAUB), which robustly achieves these
objectives across a variety of real-world datasets.
We further develop substantial theoretical support for DAUB in an idealized
setting where the expected accuracy of a classifier trained on samples can
be known exactly. Under these conditions we establish a rigorous sub-linear
bound on the regret of the approach (in terms of misallocated data), as well as
a rigorous bound on suboptimality of the selected classifier. Our accuracy
estimates using real-world datasets only entail mild violations of the
theoretical scenario, suggesting that the practical behavior of DAUB is likely
to approach the idealized behavior.Comment: AAAI-2016: The Thirtieth AAAI Conference on Artificial Intelligenc
Answering Complex Questions Using Open Information Extraction
While there has been substantial progress in factoid question-answering (QA),
answering complex questions remains challenging, typically requiring both a
large body of knowledge and inference techniques. Open Information Extraction
(Open IE) provides a way to generate semi-structured knowledge for QA, but to
date such knowledge has only been used to answer simple questions with
retrieval-based methods. We overcome this limitation by presenting a method for
reasoning with Open IE knowledge, allowing more complex questions to be
handled. Using a recently proposed support graph optimization framework for QA,
we develop a new inference model for Open IE, in particular one that can work
effectively with multiple short facts, noise, and the relational structure of
tuples. Our model significantly outperforms a state-of-the-art structured
solver on complex questions of varying difficulty, while also removing the
reliance on manually curated knowledge.Comment: Accepted as short paper at ACL 201
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
We present a new kind of question answering dataset, OpenBookQA, modeled
after open book exams for assessing human understanding of a subject. The open
book that comes with our questions is a set of 1329 elementary level science
facts. Roughly 6000 questions probe an understanding of these facts and their
application to novel situations. This requires combining an open book fact
(e.g., metals conduct electricity) with broad common knowledge (e.g., a suit of
armor is made of metal) obtained from other sources. While existing QA datasets
over documents or knowledge bases, being generally self-contained, focus on
linguistic understanding, OpenBookQA probes a deeper understanding of both the
topic---in the context of common knowledge---and the language it is expressed
in. Human performance on OpenBookQA is close to 92%, but many state-of-the-art
pre-trained QA methods perform surprisingly poorly, worse than several simple
neural baselines we develop. Our oracle experiments designed to circumvent the
knowledge retrieval bottleneck demonstrate the value of both the open book and
additional facts. We leave it as a challenge to solve the retrieval problem in
this multi-hop setting and to close the large gap to human performance.Comment: Published as conference long paper at EMNLP 201
Closing the Gap Between Short and Long XORs for Model Counting
Many recent algorithms for approximate model counting are based on a
reduction to combinatorial searches over random subsets of the space defined by
parity or XOR constraints. Long parity constraints (involving many variables)
provide strong theoretical guarantees but are computationally difficult. Short
parity constraints are easier to solve but have weaker statistical properties.
It is currently not known how long these parity constraints need to be. We
close the gap by providing matching necessary and sufficient conditions on the
required asymptotic length of the parity constraints. Further, we provide a new
family of lower bounds and the first non-trivial upper bounds on the model
count that are valid for arbitrarily short XORs. We empirically demonstrate the
effectiveness of these bounds on model counting benchmarks and in a
Satisfiability Modulo Theory (SMT) application motivated by the analysis of
contingency tables in statistics.Comment: The 30th Association for the Advancement of Artificial Intelligence
(AAAI-16) Conferenc
Transformers Can Be Translated to First-Order Logic with Majority Quantifiers
Characterizing the implicit structure of the computation within neural
networks is a foundational problem in the area of deep learning
interpretability. Can their inner decision process be captured symbolically in
some familiar logic? We show that any transformer neural network can be
translated into an equivalent fixed-size first-order logic formula which may
also use majority quantifiers. The idea is to simulate transformers with highly
uniform threshold circuits and leverage known theoretical connections between
circuits and logic. Our findings also reveal the surprising fact that the
entire transformer computation can be reduced merely to the division of two
(large) integers. While our results are most pertinent for transformers, they
apply equally to a broader class of neural network architectures, namely those
with a fixed-depth uniform computation graph made up of standard neural net
components, which includes feedforward and convolutional networks
The Parallelism Tradeoff: Limitations of Log-Precision Transformers
Despite their omnipresence in modern NLP, characterizing the computational
power of transformer neural nets remains an interesting open question. We prove
that transformers whose arithmetic precision is logarithmic in the number of
input tokens (and whose feedforward nets are computable using space linear in
their input) can be simulated by constant-depth logspace-uniform threshold
circuits. This provides insight on the power of transformers using known
results in complexity theory. For example, if (i.e.,
not all poly-time problems can be solved using logarithmic space), then
transformers cannot even accurately solve linear equalities or check membership
in an arbitrary context-free grammar with empty productions. Our result
intuitively emerges from the transformer architecture's high parallelizability.
We thus speculatively introduce the idea of a fundamental parallelism tradeoff:
any model architecture as parallelizable as the transformer will obey
limitations similar to it. Since parallelism is key to training models at
massive scale, this suggests a potential inherent weakness of the scaling
paradigm.Comment: Accepted at TACL. Formerly entitled "Log-Precision Transformers are
Constant-Depth Threshold Circuits". Updated with minor corrections in Section
2 (Implications) on March 6, 2023. Update with minor edits to the proof of
Lemma 3 on April 26, 202
The Expressive Power of Transformers with Chain of Thought
Recent theoretical work has identified surprisingly simple reasoning
problems, such as checking if two nodes in a graph are connected or simulating
finite-state machines, that are provably unsolvable by standard transformers
that answer immediately after reading their input. However, in practice,
transformers' reasoning can be improved by allowing them to use a "chain of
thought" or "scratchpad", i.e., generate and condition on a sequence of
intermediate tokens before answering. Motivated by this, we ask: Does such
intermediate generation fundamentally extend the computational power of a
decoder-only transformer? We show that the answer is yes, but the amount of
increase depends crucially on the amount of intermediate generation. For
instance, we find that transformer decoders with a logarithmic number of
decoding steps (w.r.t. the input length) push the limits of standard
transformers only slightly, while a linear number of decoding steps adds a
clear new ability (under standard complexity conjectures): recognizing all
regular languages. Our results also imply that linear steps keep transformer
decoders within context-sensitive languages, and polynomial steps make them
recognize exactly the class of polynomial-time solvable problems -- the first
exact characterization of a type of transformers in terms of standard
complexity classes. Together, our results provide a nuanced framework for
understanding how the length of a transformer's chain of thought or scratchpad
impacts its reasoning power.Comment: 9-page preprin
- …